Clustering
Scaling methods
Clustering algorithms take a set of inputs and attempt to identify some latent “groups” in the data.
These data are assumed to be “unlabeled”: we don’t have specific groups in mind, or at least we haven’t labeled them beforehand.
Goal: find the set of “K” group assignments that minimizes the “within-cluster sum of squares”
K is any number between 1 and the sample size, and the researcher chooses it.
What about K = 3, or data with more than 2 features?
The k-means algorithm finds optimal clusters automatically.
The within-cluster sum of squares should shrink until it converges on a minimum
We picked k=3 and two features for the sake of simplifying the display, but usually you’ll use more features and more clusters.
There are some heuristics for choose an optimum value of “K”, but it’s often partly a judgement call.
ches data set and then use kmeans to perform K-means clustering on those components.library(tidyverse)
library(readr)
ches<-read_csv('https://www.chesdata.eu/s/CHES_2024_final_v2.csv')
labels<-c("Radical Right",
"Conservatives",
"Liberal",
"Christian-Democratic",
"Socialist",
"Radical Left",
"Green",
"Regionalist",
"No family",
"Confessional",
"Agrarian/Center")
ches$family<-factor(ches$family, labels=labels)
country_levels<-c(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 31, 34, 35, 36, 37, 38, 40, 45)
country_labels<-c("Belgium", "Denmark", "Germany", "Greece", "Spain", "France",
"Ireland", "Italy", "Netherlands", "United Kingdom", "Portugal",
"Austria", "Finland", "Sweden", "Bulgaria", "Czech Republic",
"Estonia", "Hungary", "Latvia", "Lithuania","Poland", "Romania",
"Slovakia", "Slovenia", "Croatia", "Turkey", "Norway", "Switzerland",
"Malta", "Luxembourg", "Cyprus", "Iceland")
ches$country<-factor(ches$country, levels=country_levels, labels=country_labels)Dimensionality reduction techniques take a set of variables or relationships and simplify or summarize them in a smaller number of dimensions.
Useful for:
Principal components analysis takes a matrix and spits out a new one of equal size where:
Variations on PCA can also be used to infer similarities or differences between legislators or countries using roll-call votes.
Make a N x N matrix that counts how many times each member voted together
Calculate the Euclidian “distance” between each legislator
Use PCA on the distance matrix and take the first K dimensions
| A | B | C | D | |
|---|---|---|---|---|
| A | 1 | 1 | 1 | 0 |
| B | 1 | 1 | 0 | 0 |
| C | 0 | 0 | 1 | 0 |
| D | 1 | 1 | 1 | 1 |
For instance, here’s the result from scaling UN voting behavior from 2010 to 2019 and taking the first two components:
Poole and Rosenthal’s DW-Nominate scores use something similar to this approach
Finally, PCA can be used as a pre-processing step for supervised models as a way to address the “curse of dimensionality” problem we talked about last class, just be sure to “train” the PCA model on the training data and then “predict” on the testing data, just like you would with a supervised model.
Use scale to scale your features from the party cluster analysis.
Use prcomp to perform the PCA
Extract the first two principal components of the model
Plot your clusters using the PCA values, color-coded by cluster.